Generalization Error
Generalization of Finite hypothesis space
Generalization for fixed f
Lemma: High probability bounds for fixed f
\[Pr(Er_{in}-Er_{out} >= t) <= e^{-2nt^2}\]- proof 1: 证明 $E(Er_{in})=Er_{out}$
- proof 2: apply Hoeffding‘s inequality Xi is independent random with mean ui and bounded on [ai,bi]
Proposition generalization bound for fixed f
the following bound holds with probability $1-\delta$:
- proof apply the above lemma, let $\delta = e^{-2nt^2}$ , it yeilds the generalization bound.
Generalization for finite hypothesis space
Uniform Bound for all posibile f with probability with $1-\delta$ :
\[Er_{out} <= Er_{in} + \sqrt{\frac{log(|H|/\delta)}{2n}} \forall f \in h\]proof
- 等价于证明 存在一个f满足 $Er_{in} <= Er_{out} - \sqrt{\frac{log(|H|/\delta)}{2n}}$ 的概率 < $\delta$,
- apply union bound(非常宽松,单个的概率*|H|) and generalization bound for a fixed f,可得概率为 $|H|e^{-2nt^2}$
- setting $|H|e^{-2nt^2} = \delta$ yields the desired result
comment
- from a f to finite space H, the error bound $\sqrt{\frac{log(1/\delta)}{2n}}$ changes to $\sqrt{\frac{log(|H|/\delta)}{2n}}$
- the genralization error increases when |H| grows, but only logrithmically
- H is usually infinite and union bound can be very coarse
Generalization of Infinite hypothesis space
replace |H| with VC dimension
only discuss binary classification
Growth function
- Dichotomy 对一组有限样本n的分类结果
- Dichotomies of H H中所有f产生的互斥分类结果的集合(可以想到如果f有n个,分类结果可能会少于n)
- growth function G(n) 考虑所有可能的n个样本的组合,选出具有最多Dichotomies的数量即为G(n)
comment
- Growth function measure the richness of H,based on n points rather than cardinality of H
- it’s impracical to calculate G exactly for every n
- we know $2^n$ is a upper bound for G(n), but it’s trivial due to log it will be a constant which means larger n cannot bring better generalization
- can we find another tighter one?–Yes!vc dimension
break point and VC dimension
- shatter
- break point: 不能被shatter的sample size
- VC dimension:最大的能被shatter的sample size
Theorm for d-dimensional binary linear classier, we have : dvc = d + 1
- it’s exactly the effective number of parameters, and hence the complexity of H
VC generalization bound
\[O(\sqrt{\frac{dvc}{n}})\]comment
- it’s a litle bit loose, but establish the feasibility of learning for infinite H
- once n is enough large, learning is likely feasible
- give us some rules to garantee the bound: e.g. about the number of samples:n >= 10dvc for linear regression
- [] todo